Lait Equitable: A Data-Driven Approach to Fair Trade milk

Author

Jayesh Smith and Emeline Raimon-Dacunha-Castelle Urs Hurni

Published

May 10, 2024

Abstract

The following project focuses on the analysis of the Lait Equitable dataset, which contains information on the production of fair trade milk in Switzerland. The goal of this project is to analyze the dataset and identify trends and patterns in the data that can help us better understand the production of fair trade milk in Switzerland. We will use a variety of data analysis techniques, including exploratory data analysis, data visualization, and statistical modeling, to analyze the dataset and draw conclusions about the production of fair trade milk in Switzerland.

1 Introduction

1.1 Company Background

Lait Equitable, also known as Fair Milk, is a transformative Swiss initiative designed to ensure fair compensation for dairy producers across Switzerland. Founded in response to a prolonged crisis in the dairy industry, Lait Equitable aims to address the substantial disparity between the cost of milk production and the compensation received by producers. Historically, the plummeting prices of milk have often failed to cover production costs, leading to financial stress among producers. The cooperative structure of Lait Equitable ensures that each liter of milk is sold for 1 CHF, aligning closely with the actual production cost of approximately 98 cents per liter as estimated by Agridea in 2016.

Key Points:

  • Foundation: Lait Equitable was established to counteract the severe financial difficulties faced by Swiss milk producers due to inadequate compensation for their products. This initiative emerged as a critical response to the drastic reduction in the number of operational dairy farms, which saw a decline from about 44,000 producers 25 years ago to 17,164 by the end of 2023.
  • Operation: The initiative operates through the “Lait Equitable” cooperative. This entity is responsible for collecting, processing, and distributing milk, ensuring that producers receive fair compensation. It is noteworthy that while the cooperative guarantees 1 CHF per liter, the actual market price often falls significantly below this, making the cooperative’s top-up essential for fair compensation.
  • Product Range: Under the brand Faireswiss, Lait Equitable offers a range of dairy products available at various retail outlets. While the milk in Faireswiss products may not always directly come from the cooperative members due to logistical reasons, the fair compensation model remains intact.
  • Membership: The cooperative is inclusive, welcoming all Swiss milk producers, including those involved in specialty products like Gruyère cheese. While the cooperative initially included cheese producers, decisions made in recent general assemblies have temporarily restricted this inclusion, subject to future reassessment.

This initiative not only supports the livelihood of local farmers but also forms part of a broader European movement towards fair milk pricing, coordinated by the European Milk Board (EMB). Lait Equitable serves as a beacon of hope and a model for sustainable agricultural practices in the dairy industry, ensuring producers can live off their work dignifiedly.

Source - faireswiss.ch ## Related Work

1.2 Research questions

  1. Sales Analysis Across Manors: To determine the factors contributing to the success of Lait Equitable’s milk in some Manor stores but not in others.
  2. Price Differences: To analyze the price differences between conventional and organic milk and their impact on market demand and supply.
  3. Demand for Organic Milk: To study how the demand for organic milk has evolved in recent years and identify the driving factors behind this trend.

2 Data

  • Sources
  • Description
  • Wrangling/cleaning
  • Spotting mistakes and missing data (could be part of EDA too)
  • Listing anomalies and outliers (could be part of EDA too)

2.1 Swiss Producer

IMPLEMENTER EMELINE PART

2.1.1 Wrangling and Cleaning

Click to show code
library(data.table)

file_path <- "../data/"

df_producteur <- read_excel(paste0(file_path, "lait_cru_producteur.xlsx"), sheet = 1)

df_producteur$date <- as.Date(df_producteur$date)
 library(kableExtra)
# Create a tibble with variable descriptions for df_producteur
variable_table <- tibble(
  Variable = c("Date", "prix_bio", "prix_non_bio", "delta", "Delta_pourcent"),
  Description = c(
    "The date when the prices were recorded, in a year-month-day format.",
    "The recorded price of organic milk on the given date.",
    "The recorded price of non-organic milk on the given date.",
    "The absolute difference between the organic and non-organic milk prices.",
    "The percentage difference between the organic and non-organic milk prices."
  )
)

# Display the table using kableExtra
variable_table %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))
Variable Description
Date The date when the prices were recorded, in a year-month-day format.
prix_bio The recorded price of organic milk on the given date.
prix_non_bio The recorded price of non-organic milk on the given date.
delta The absolute difference between the organic and non-organic milk prices.
Delta_pourcent The percentage difference between the organic and non-organic milk prices.

2.1.2 Description

This data set explains the Milk price to producers over time. Essential for Macro Anaylsis of the milk industry in Switzerland.

Have a look :

Click to show code
# create a new data cleaned
df_producteur_show <- df_producteur %>%
  mutate(delta = prix_bio - prix_non_bio,
         delta_pourcent = (prix_bio - prix_non_bio) / prix_non_bio * 100) %>%
  select(date, prix_bio, prix_non_bio, delta, delta_pourcent) %>%
  #round all column  to 2 decimal places
  mutate_if(is.numeric, round, 2) 

#print max and min values for delta_pourcent
# max_delta_pourcent <- max(df_producteur_show$delta_pourcent, na.rm = TRUE)
# max_delta_pourcent
# min_delta_pourcent <- min(df_producteur_show$delta_pourcent, na.rm = TRUE)
# min_delta_pourcent

#display cleaned data using reactable
library(reactable)
reactable(
  df_producteur_show,  
  highlight = TRUE,  # Highlight rows on hover
  defaultPageSize = 10,  # Display 10 rows per page
  paginationType = "numbers",  # Use numbers for page navigation
  searchable = TRUE,  # Make the table searchable
  sortable = TRUE,  # Allow sorting
  resizable = TRUE  # Allow column resizing
)

Source - asjdaksjdh

2.2 Lait Equitable Sales

2.2.1 Dataset Sales 2023

2.2.1.1 Wrangling and Cleaning

Click to show code
file_path = '../data/all_products_sales_per_stores_2023.xlsx'
df = pd.read_excel(file_path, sheet_name='Sheet1')

df.columns = df.columns.astype(str)
# Renaming the monthly columns for easier readability
new_column_names = {
    '2023-01-01 00:00:00': 'Jan 2023',
    '2023-02-01 00:00:00': 'Feb 2023',
    '2023-03-01 00:00:00': 'Mar 2023',
    '2023-04-01 00:00:00': 'Apr 2023',
    '2023-05-01 00:00:00': 'May 2023',
    '2023-06-01 00:00:00': 'Jun 2023',
    '2023-07-01 00:00:00': 'Jul 2023',
    '2023-08-01 00:00:00': 'Aug 2023',
    '2023-09-01 00:00:00': 'Sep 2023',
    '2023-10-01 00:00:00': 'Oct 2023',
    '2023-11-01 00:00:00': 'Nov 2023',
    '2023-12-01 00:00:00': 'Dec 2023'
}
df.rename(columns=new_column_names, inplace=True)

# Standardize city names based on the mapping provided
correct_city_names = {
    'Bâle': 'Basel',
    'Genève': 'Geneva',
    'Bienne': 'Biel/Bienne',
    'Chavannes': 'Chavannes-de-Bogis',
    'Marin': 'Marin-Epagnier',
    'Vesenaz': 'Vésenaz',
    'Yverdon': 'Yverdon-les-Bains',
    'Saint-Gall Webersbleiche': 'St. Gall'
}
df['Row Labels'] = df['Row Labels'].apply(lambda x: correct_city_names.get(x, x))

The dataset from 2023 was meticulously cleaned and standardized to ensure accuracy in our analysis. Initial steps included loading the data from an Excel file and renaming columns to reflect clearer, month-specific sales data for easier readability. Additionally, we corrected city names to maintain consistency across datasets. This included mapping various forms of city names to their standardized counterparts (e.g., ‘Bâle’ to ‘Basel’).

2.2.1.2 Description

The dataset is very light and contains monthly sales data for the year 2023. It is essential however for the analysis of the sales of Lait Equitable in different Manor stores.

Here is a preview of the cleaned and structured data:

Click to show code
#load python df
df_sales_2023 <- py$df

# Load necessary libraries
library(tibble)
library(kableExtra)

# Create a tibble with variable descriptions for df_manor_sales
variable_table <- tibble(
  Variable = c("Row Labels", "Monthly Columns (2023-01-01 to 2023-12-01)", "Grand Total"),
  Description = c(
    "Identifies the Manor store location by name.",
    "Each column represents sales figures for a specific month of 2023",
    "Total sales across all months of 2023 for each location"
  )
)

# Display the table using kableExtra
variable_table %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))
Variable Description
Row Labels Identifies the Manor store location by name.
Monthly Columns (2023-01-01 to 2023-12-01) Each column represents sales figures for a specific month of 2023
Grand Total Total sales across all months of 2023 for each location
Click to show code

# Using the provided column names correctly in the dataframe df_sales_2023
df_sales_2023_show <- df_sales_2023 %>%
  # Ensure you convert the column names to standard ones if needed
  rename(Location = `Row Labels`) %>%
  # Correctly sum the monthly sales columns from Jan 2023 to Dec 2023
  mutate(Total_Sales = rowSums(select(., `Jan 2023`:`Dec 2023`), na.rm = TRUE)) %>%
  select(Location, `Jan 2023`:`Dec 2023`, Total_Sales) %>%
  mutate_if(is.numeric, round, 2)  # round all numeric columns to 2 decimal places

# Display the data using reactable for an interactive table
reactable(
  df_sales_2023_show,  
  highlight = TRUE,  # Highlight rows on hover
  defaultPageSize = 10,  # Display 10 rows per page
  paginationType = "numbers",  # Use numbers for page navigation
  searchable = TRUE,  # Make the table searchable
  sortable = TRUE,  # Allow sorting
  resizable = TRUE  # Allow column resizing
)

2.2.2 Dataset Sales 2022

2.2.2.1 Wrangling and Cleaning

Following the methodology established with the 2023 dataset, the 2022 sales data was similarly processed. The data from 2022, while structurally different, was also standardized to facilitate comparison and analysis. This included renaming columns to ensure uniformity in location names across both datasets.

Click to show code
# Load the data for 2022
file_path_2022 = '../data/sales_2022.xlsx'
df_2022 = pd.read_excel(file_path_2022)

# Standardize city names based on the provided mapping
city_name_mapping = {
    'Bâle': 'Basel',
    'Genève': 'Geneva',
    'Bienne': 'Biel/Bienne',
    'Chavannes': 'Chavannes-de-Bogis',
    'Marin': 'Marin-Epagnier',
    'Vesenaz': 'Vésenaz',
    'Yverdon': 'Yverdon-les-Bains',
    'Saint-Gall Webersbleiche': 'St. Gall'
}

# Rename columns to standardize city names
df_2022.rename(columns=city_name_mapping, inplace=True)

# Pivoting the table to get total sales per location for 2022, summing across all products
sales_columns_2022 = [col for col in df_2022.columns if col not in ['Code article', 'Description article', 'Marque', 'Code Fournisseur', 'Description Fournisseur']]
df_2022_total_sales = df_2022[sales_columns_2022].sum().reset_index()
df_2022_total_sales.columns = ['Location', 'Total Sales 2022']

2.2.2.2 Description

The 2022 dataset, unlike the 2023 dataset, includes a variety of products, each recorded with sales figures across different locations. This dataset is notably less complex, focusing on total sales rather than monthly breakdowns, yet provides critical insights into the sales performance of different products.

Click to show code
# Load the 2022 sales data
df_sales_2022 <- py$df_2022
# Load necessary libraries
library(tibble)
library(kableExtra)

# Create a tibble with variable descriptions for df_sales
variable_table <- tibble(
  Variable = c("Code article", "Description article", "Marque", "Code Fournisseur", "Description Fournisseur",
               "Location Columns (e.g., Ascona-Delta, Baden, Bâle, etc.)"),
  Description = c(
    "Unique identifier for each product.",
    "Descriptive name of the product.",
    "Brand of the product.",
    "Supplier code.",
    "Supplier name.",
    "Each of these columns represents sales figures for that specific location."
  )
)

# Display the table using kableExtra
variable_table %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))
Variable Description
Code article Unique identifier for each product.
Description article Descriptive name of the product.
Marque Brand of the product.
Code Fournisseur Supplier code.
Description Fournisseur Supplier name.
Location Columns (e.g., Ascona-Delta, Baden, Bâle, etc.) Each of these columns represents sales figures for that specific location.

Here is a closer look at the structured 2022 sales data:

Click to show code
library(dplyr)
library(reactable)

# Assuming the dataframe is correctly named df_sales_2022 and is already loaded
# Ensure the dataframe is available in your R environment
print(head(df_sales_2022))
#>   Code article                      Description article
#> 1        1e+08 CRÈME CAFÉ UHT15%MG LAIT FAIRSWISS10X12G
#> 2        1e+08        LAIT ÉQUITABLE ENTIER 3.5% UHT 1L
#> 3        1e+08         LAIT ÉQUITABLE DRINK 1.5% UHT 1L
#> 4        1e+08        LAIT ÉQUITABLE DRINK 1.5% PAST 1L
#> 5        1e+08       LAIT ÉQUITABLE ENTIER 3.5% PAST 1L
#>                Marque Code Fournisseur Description Fournisseur
#> 1 FAIRSWISS FAIRSWISS          2791100                CREMO SA
#> 2          Cremo (NA)          2791100                CREMO SA
#> 3             NA (NA)          2791100                CREMO SA
#> 4          Cremo (NA)          2791100                CREMO SA
#> 5          Cremo (NA)          2791100                CREMO SA
#>   Ascona-Delta Baden Basel Balerna Biel/Bienne Chavannes-de-Bogis
#> 1          661   224   333     148         280                689
#> 2          701   215   201     416        3025               7532
#> 3          498   134   313     183        3238               5547
#> 4          107   127    45      32          19                297
#> 5           97   130   110      48          49                294
#>   Chur Delémont Emmen Fribourg Geneva Lausanne Lugano Marin-Epagnier
#> 1  236      242   553      679   1088      732    719            337
#> 2  101     2405   290     8025   6509     6235    399          16843
#> 3   86     1427   243     5946   8174     5888    491          13022
#> 4   11       22    13       61    161      175     54             11
#> 5   47       35     8       81    136      156    108             15
#>   Monthey Morges Nyon Rapperswil St. Gall San Antonino Sargans
#> 1     636    453  221        172      204          107     220
#> 2   30013   4017 1895         83      261          415     351
#> 3   19175   3799 1653        112      327          338     215
#> 4      54     60   53          7       16           31      21
#> 5      87     95   41          3       40           44      77
#>   Sierre Sion Vésenaz Vevey Vezia Yverdon-les-Bains
#> 1    581  357     315   642   364               412
#> 2  17582 7809    2197 12646   487              5004
#> 3  13905 6901    1707 12172   407              5450
#> 4     52   82      59   129    65                29
#> 5     92   59      36   131   141                24

# Prepare the data by calculating the total sales per product across all locations
df_sales_2022_show <- df_sales_2022 %>%
  mutate(Total_Sales = rowSums(select(., `Ascona-Delta`:`Yverdon-les-Bains`), na.rm = TRUE)) %>%
  select(`Code article`, `Description article`, `Marque`, `Code Fournisseur`, `Description Fournisseur`, `Ascona-Delta`:`Yverdon-les-Bains`, Total_Sales) %>%
  mutate_if(is.numeric, round, 2)  # Round all numeric columns to 2 decimal places

# Display the data using reactable for an interactive and visually appealing table
reactable(
  df_sales_2022_show,
  highlight = TRUE,
  defaultPageSize = 5,
  paginationType = "numbers",
  searchable = TRUE,
  sortable = TRUE,
  resizable = TRUE
)

2.2.2.3 Merging 2022 and 2023 dataset

Click to show code
# Extracting the total sales for 2023 from the first dataset
df_2023_total_sales = df[['Row Labels', 'Grand Total']].rename(columns={'Row Labels': 'Location', 'Grand Total': 'Total Sales 2023'})

# Merging the 2022 and 2023 datasets on Location
merged_sales_data = pd.merge(df_2022_total_sales, df_2023_total_sales, on='Location', how='outer')

# Filling any NaN values that might have occurred due to locations present in one dataset and not the other
merged_sales_data.fillna(0, inplace=True)
Click to show code
# Load the merged sales data
df_merged_sales <- py$merged_sales_data
#show it using reactable
reactable(
  df_merged_sales,  
  highlight = TRUE,  # Highlight rows on hover
  defaultPageSize = 10,  # Display 10 rows per page
  paginationType = "numbers",  # Use numbers for page navigation
  searchable = TRUE,  # Make the table searchable
  sortable = TRUE,  # Allow sorting
  resizable = TRUE  # Allow column resizing
)

The 2022 sales data has been aggregated and standardized for each location. The merged dataset now shows the total sales per location for both 2022 and 2023. This dataset offers a comprehensive view of Lait Equitable’s sales dynamics over two consecutive years, highlighting trends and changes in consumer behavior across different locations.

2.2.3 Political Parties Dataset

2.2.3.1 Wrangling and Cleaning

The analysis starts by importing two datasets: sales data (annual sales of fair trade milk) and political party data (support percentages for major parties by location). The political data is cleaned to match commune names in the sales data and transformed into party presence percentages.

Next, the cleaned political data is merged with the sales data based on commune names. This merged dataset enables a combined analysis of political party presence and sales performance.

Click to show code
# Read sales data from Excel
sales_data <- read_excel("../data/Ventes annuelles.xlsx")

# Read political party data from Excel
party_data <- read_excel("../data/partisPolitiqueManor.xlsx")

# Clean up party_data to match sales_data locations
party_data_cleaned <- party_data %>%
  mutate(Location = gsub(" ", "", Location)) %>%
  filter(Location %in% sales_data$Location)

# Calculate party presence percentages for each location
party_data_cleaned <- party_data_cleaned %>%
  mutate(PLR_Presence = PLR / (PLR + PS + UDC + Centre + Verts) * 100,
         PS_Presence = PS / (PLR + PS + UDC + Centre + Verts) * 100,
         UDC_Presence = UDC / (PLR + PS + UDC + Centre + Verts) * 100,
         Centre_Presence = Centre / (PLR + PS + UDC + Centre + Verts) * 100,
         Verts_Presence = Verts / (PLR + PS + UDC + Centre + Verts) * 100)

# Merge sales_data with party presence data
merged_data <- merge(sales_data, party_data_cleaned, by = "Location")

The analysis starts by importing one other dataset: revenue per capita per commune data. The revenue data is cleaned to match commune names in the sales data.

Next, the cleaned revenue data is merged with the sales data based on commune names. This merged dataset enables a combined analysis of revenue per capita per commune and sales performance.

Click to show code
# Load the datasets
revenu_df <- read_excel("../data/revenuParContribuable_CommuneManor.xlsx")
ventes_df <- read_excel("../data/Ventes annuelles.xlsx")

# Merge the datasets on the "Location" column
merged_df <- inner_join(revenu_df, ventes_df, by = "Location")

# Clean the data and convert to numeric format
merged_df$`Revenu/contribuable` <- as.numeric(gsub(" ", "", merged_df$`Revenu/contribuable`))
merged_df$`2022` <- as.numeric(gsub(" ", "", merged_df$`2022`))
merged_df$`2023` <- as.numeric(gsub(" ", "", merged_df$`2023`))

2.2.3.2 Description

Write kableextra to describe the dataset here

Click to show code
# Display the merged data using reactable
reactable(
  merged_data,  
  highlight = TRUE,  # Highlight rows on hover
  defaultPageSize = 10,  # Display 10 rows per page
  paginationType = "numbers",  # Use numbers for page navigation
  searchable = TRUE,  # Make the table searchable
  sortable = TRUE,  # Allow sorting
  resizable = TRUE  # Allow column resizing
)

UTILISER CODE POUR DISPLAY DATA (JAYESH)

3 Exploratory data analysis

EMELINE EDA

3.1 Lait Equitable Products

To provide a comprehensive understanding of Lait Equitable’s sales trends throughout 2023, we performed a month-by-month sales analysis. This exploration helps identify seasonal effects, peak sales periods, and potential areas for strategic adjustments. Here’s a detailed breakdown of the approach and findings:

3.1.1 Sales Distribution Accross Months

Click to show code
# Remove 'Grand Total' column and the row labels column
monthly_sales <- df_sales_2023 %>% 
  select(-c(`Grand Total`, `Row Labels`))

# Aggregate the sales per month across all locations
total_sales_per_month <- colSums(monthly_sales)

# Create a data frame for plotting
monthly_sales_df <- data.frame(Month = names(total_sales_per_month), Sales = total_sales_per_month)

# Sort the data frame by Sales in descending order
sorted_monthly_sales_df <- monthly_sales_df %>%
  arrange(desc(Sales))

# Plotting with ggplot2 using viridis color palette
ggplot(sorted_monthly_sales_df, aes(x=reorder(Month, -Sales), y=Sales, fill=Month)) +
  geom_bar(stat="identity", show.legend = FALSE, fill = "#24918d", color = "black") + 
  labs(title = "Total Sales by Month Across All Locations (2023)", x = "Month", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The graph shows total monthly sales across all locations for the “Lait Equitable” in 2023. It shows the month with the highest sales, March and continuing to lower sales months. The least profitable month appears to be July.

  • Highest Sales in March: The graph starts with March, which shows the highest sales, almost reaching 25,000 units. This suggests that March was a particularly strong month for sales, possibly due to seasonal factors or specific marketing campaigns.
  • Gradual Decline in Sales: As we move from left to right, there is a general trend of declining sales. After March, the next highest sales are in December, followed by April, May, and so on. This indicates that sales in March were not sustained throughout the year.
  • Mid-year and End-Year Trends: While the graph is not in chronological order, it shows that some months like December (typically strong due to the holiday season) also performed well, but none reached the peak seen in March.
  • Lower Sales in the Latter Months Displayed: The months at the right end of the graph, such as June and July, show the lowest sales figures in the year. This could indicate a seasonal dip or other market dynamics affecting these months. One supposition could be that people are on vacations at these dates due to school vacations.

3.1.2 Sales Distribution Accross Locations

Click to show code
# First, we need to remove the 'Grand Total' column if it's included
df <- df_sales_2023[, -ncol(df_sales_2023)]
# Sum sales across all months for each location
total_sales_by_location <- df %>%
  mutate(Total_Sales = rowSums(select(., -`Row Labels`))) %>%
  select(`Row Labels`, Total_Sales)

# Sort the locations by total sales in descending order
sorted_sales_by_location <- total_sales_by_location %>%
  arrange(desc(Total_Sales))

# Plotting the data with ggplot2
ggplot(sorted_sales_by_location, aes(x=reorder(`Row Labels`, Total_Sales), y=Total_Sales, fill=`Row Labels`)) +
  geom_bar(stat="identity", show.legend = FALSE,  fill = "#24918d", color = "black") + 
  labs(title = "Total Sales by Location (2023)", x = "Location", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=0.5)) # Rotate the x-axis text for better readability

The graph illustrates the total sales by location for Lait Equitable across various stores in 2023, organized from the lowest to the highest sales volume.

  • Variability in Sales Across Locations: The graph displays a significant variation in sales across different locations. The left side of the graph shows locations with the least sales, starting with Chur, Rapperswil, St. Gall, and progressively increasing towards the right.
  • Low Sales in Certain Areas: Locations like Chur, Rapperswil, and St. Gall have notably low sales, which could indicate either a lower demand for Lait Equitable’s products in these areas or possibly less effective marketing and distribution strategies.
  • High Sales in Specific Locations: The right end of the graph, particularly the last five locations, shows a sharp increase in sales. Notably, Vevey, Marin-Epagnier, Sierre and Monthey exhibit high sales, with Monthey being the highest. This might indicate a stronger market presence, better consumer acceptance, or more effective promotional activities in these regions.
  • Potential Market Strengths and Weaknesses: The graph effectively highlights where Lait Equitable is performing well and where there might be room for improvement. For instance, the high sales in cities like Sierre and Monthey suggest strong market penetration and acceptance.
  • Strategic Insights: For the Lait Equitable, this graph provides crucial data points for understanding which locations might need more focused marketing efforts or adjustments in distribution strategies. Additionally, it could help in identifying successful strategies in high-performing locations that could be replicated in areas with lower sales.
Click to show code
#remove grand total
df <- df_sales_2023[, -ncol(df_sales_2023)]
# Transform the data into a long format where each row contains a location, a month, and sales
long_data <- df %>%
  pivot_longer(cols = -`Row Labels`, names_to = "Month", values_to = "Sales") %>%
  mutate(Location = `Row Labels`)

# Create a plotly object for an interactive boxplot
fig <- plot_ly(long_data, x = ~Location, y = ~Sales, type = 'box',
               hoverinfo = 'text', text = ~paste('Month:', Month, '<br>Sales:', Sales),
               marker = list(color = "#7e57c2",
                             boxpoints = "all",
                             jitter = 0.3),
               box = list(line = list(color = "#24918d"))
               ) %>% 
  layout(title = "Distribution of Monthly Sales Across Locations",
         xaxis = list(title = "Location"),
         yaxis = list(title = "Monthly Sales"),
         showlegend = FALSE, 
         width= 600,
         height = 800) %>% 
  config(displayModeBar = FALSE) # Optional: hide the mode bar
#> Warning: Specifying width/height in layout() is now deprecated.
#> Please specify in ggplotly() or plot_ly()


# Display the plot
fig
#> Warning: 'box' objects don't have these attributes: 'box'
#> Valid attributes include:
#> 'alignmentgroup', 'boxmean', 'boxpoints', 'customdata', 'customdatasrc', 'dx', 'dy', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hoveron', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'jitter', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lowerfence', 'lowerfencesrc', 'marker', 'mean', 'meansrc', 'median', 'mediansrc', 'meta', 'metasrc', 'name', 'notched', 'notchspan', 'notchspansrc', 'notchwidth', 'offsetgroup', 'opacity', 'orientation', 'pointpos', 'q1', 'q1src', 'q3', 'q3src', 'quartilemethod', 'sd', 'sdsrc', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textsrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'upperfence', 'upperfencesrc', 'visible', 'whiskerwidth', 'width', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

This graphs show the variability in sales across different locations for each month in 2023. The boxplot provides a visual representation of the distribution of sales figures, highlighting the range, median, and outliers in each location.

We observe that the outliers are high or low sales months as we analyze previously. It confirms the previous analysis and provides a more detailed view of the sales distribution across locations.

3.1.3 Top Performing / Worse Performing Locations

Click to show code
#using grand total to sort the data from top to bottom
df <- df_sales_2023
#using 'grand total' column as total sales plot the top and bottom locations
df %>%
  arrange(desc(`Grand Total`)) %>%
  slice_head(n = 5) %>%
  select(`Row Labels`, `Grand Total`) %>%
  ggplot(aes(x = reorder(`Row Labels`, `Grand Total`), y = `Grand Total`, fill = `Row Labels`)) +
  geom_bar(stat = "identity", show.legend = FALSE, fill = "#33848D", color = "black") +
  labs(title = "Top 5 Performing Locations by Total Sales (2023)", x = "Location", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=0.5)) # Rotate the x-axis text for better readability

# worse performing locations
df %>%
  arrange(`Grand Total`) %>%
  slice_head(n = 5) %>%
  select(`Row Labels`, `Grand Total`) %>%
  ggplot(aes(x = reorder(`Row Labels`, `Grand Total`), y = `Grand Total`, fill = `Row Labels`)) +
  geom_bar(stat = "identity", show.legend = FALSE, fill = "#33848D", color = "black") +
  labs(title = "Bottom 5 Performing Locations by Total Sales (2023)", x = "Location", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=0.5)) # Rotate the x-axis text for better readability

As previously analyzed, the top and bottom performing locations are displayed in the bar charts. The top 5 locations with the highest total sales are shown in the first graph, while the bottom 5 locations with the lowest total sales are displayed in the second graph.

Top-performing locations are : Monthey, Sierre, Marin-Epagnier, Vevey, and Sion. Worse-performing locations are : Basel, St. Gall, Sargans, Rapperswil, and Chur.

3.1.4 2022 vs 2023

Click to show code
#plot a bar chart to compare the total sales in 2022 and 2023 and add transparency to the bars
df_merged_sales %>%
  ggplot(aes(x = reorder(Location, -`Total Sales 2023`), y = `Total Sales 2023`, fill = "2023")) +
  geom_bar(aes(x = reorder(Location, -`Total Sales 2022`), y = `Total Sales 2022`, fill = "2022"), stat = "identity", position = "dodge", fill = "#7e57c2", color = "black", alpha = 0.7 ) +
  geom_bar(stat = "identity", position = "dodge", fill = "#33848D", color = "black", alpha = 0.7) +
  labs(title = "Total Sales Comparison Between 2022 and 2023 by Location", x = "Location", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=0.5)) # Rotate the x-axis text for better readability

  • Trend of Decline: A significant number of Manor locations have lower sales figures in 2023 compared to 2022. This trend suggests that Lait Equitable might be facing challenges in these areas, which could include increased competition, changing consumer preferences, or other market dynamics affecting the demand for their products.

  • Monthey’s Decline: The bar chart shows that Monthey experienced a substantial decrease in sales in 2023 compared to 2022. This would be a key point of concern for Lait Equitable, and understanding why Monthey is underperforming is essential. This could be due to a range of factors, such as local economic conditions, operational challenges, increased competition, or changes in consumer preference within that particular area.

3.1.5 Map

Click to show code
import re
locations = [
      "46.5907889,6.7386114","46.5842792,6.6649234","46.5149217,6.888093","46.5366608,6.7054361","46.2962374,6.9946778","46.5575317,6.7491415","46.5407979,6.7757944","46.5732277,6.7334704","46.6179705,6.6729034","46.4438248,6.9216581","46.5244929,6.7652973","46.5535214,6.7766663","46.9087674,7.0772625","46.5072978,6.7303043","46.7658972,6.5635194","46.4890932,6.839986","46.5343528,6.7394634","46.6558216,6.7369549","46.5156779,6.8621498","46.6112911,6.8073595","46.6666946,6.8991534","46.7395322,6.9715464","46.6468967,6.9540885","46.5153462,6.8584537","46.5975359,7.0382028","46.595883,6.8855935","46.6612814,6.8782794","46.7087666,7.1239746","46.6510812,6.9711981","46.6141324,7.0318659","46.601508,7.1001934","46.6091436,7.0307497","46.5766179,6.8634981","46.5954987,6.9855279","46.829064,6.9264143","46.6588489,6.8758917","46.6124961,6.9757718","47.2915521,7.3217577","47.2775379,7.3830611","47.1202632,6.8834564","47.2786229,7.3611541","47.270679,7.39172","47.3439771,7.4949095","47.3276332,7.218765","47.3660465,7.4060015","47.3649491,7.46977","47.2032746,6.9890961","47.2092332,7.0068912","46.3260628,6.9042257","46.2126721,6.9985419","46.1163846,7.112862","46.3649142,6.8995776","46.9232972,6.4673148","47.0769951,6.756922","46.0377232,8.8837298","46.1646151,8.9659276","46.1292513,8.985035","46.0029796,8.8540675","46.4694663,8.9413945","46.1460954,8.9345517","47.3859362,7.4313549","47.4475168,7.5593599","47.3809253,7.4221544","47.4364108,9.3341963","47.4304657,9.3321608","47.4341264,9.3481821","47.4345287,9.3262813","47.426417,9.3338866","47.4652773,9.1518866","47.2749047,9.5092842","47.1135124,9.1297439","47.404214,9.3419733","47.5295845,8.4670763","47.1738126,8.3342979","47.08052,8.2031428","47.2351478,8.1477158","46.7883954,7.4004079","47.0364909,7.2786631","46.8364949,7.7665098","46.7833635,7.3912478"]

latitudes = []
longitudes = []

# Parse each location and extract latitude and longitude
for location in locations:
    lat, lon = location.split(',')
    latitudes.append(float(lat))
    longitudes.append(float(lon))

# Create a DataFrame using pandas
data = pd.DataFrame({
    'Latitude': latitudes,
    'Longitude': longitudes
})

data
#>      Latitude  Longitude
#> 0   46.590789   6.738611
#> 1   46.584279   6.664923
#> 2   46.514922   6.888093
#> 3   46.536661   6.705436
#> 4   46.296237   6.994678
#> ..        ...        ...
#> 75  47.235148   8.147716
#> 76  46.788395   7.400408
#> 77  47.036491   7.278663
#> 78  46.836495   7.766510
#> 79  46.783364   7.391248
#> 
#> [80 rows x 2 columns]
Click to show code
# Function to calculate the dynamic radius
def calculate_radius(volume, max_volume, min_volume, max_radius=20):
    normalized_volume = (volume - min_volume) / (max_volume - min_volume)
    return normalized_volume * max_radius + 3

# Function to get latitude and longitude
def get_lat_lon(city):
    try:
        time.sleep(1)  # Simple rate-limiting mechanism
        location = geolocator.geocode(city + ', Switzerland')
        return location.latitude, location.longitude
    except AttributeError:
        return None, None

# Read data from different product categories
file_paths = {
    'All Products': ("../data/Produits laitiers équitables - 2023.xlsb", 'Par SM'),
    'Milk Drink': ("../data/lait_drink_sales_per_stores_2023.xlsx", 'Sheet1'),
    'Milk Entier': ("../data/lait_entier_sales_per_stores_2023.xlsx", 'Sheet1'),
    'Fondue': ("../data/fondue_sales_per_stores_2023.xlsx", 'Sheet1'),
    'Delice': ("../data/delice_sales_per_stores_2023.xlsx", 'Sheet1'),
    'Creme': ("../data/creme_cafe_sales_per_stores_2023.xlsx", 'Sheet1')
}

# Create a folium map
m = folium.Map(location=[46.8182, 8.2275], zoom_start=8)
# Instantiate the geolocator
geolocator = Nominatim(user_agent="le_stores")

# Loop through each category
for category, (file_path, sheet_name) in file_paths.items():
    engine = 'pyxlsb' if 'xlsb' in file_path else None
    df = pd.read_excel(file_path, engine=engine, sheet_name=sheet_name)

    if category == 'All Products':
        # Skip the first six rows and rename columns based on the provided structure
        df = df.iloc[6:]  
        df.rename(columns={
            'Quantités vendues - année 2023': 'City',
            'Unnamed: 1': '01/01/2023',
            'Unnamed: 2': '02/01/2023',
            'Unnamed: 3': '03/01/2023',
            'Unnamed: 4': '04/01/2023',
            'Unnamed: 5': '05/01/2023',
            'Unnamed: 6': '06/01/2023',
            'Unnamed: 7': '07/01/2023',
            'Unnamed: 8': '08/01/2023',
            'Unnamed: 9': '09/01/2023',
            'Unnamed: 10': '10/01/2023',
            'Unnamed: 11': '11/01/2023',
            'Unnamed: 12': '12/01/2023',
            'Unnamed: 13': 'Total General'
        }, inplace=True)
    else:
        # Renaming columns for XLSX files based on your last dataframe example
        df.rename(columns={
            df.columns[0]: 'City',
            df.columns[-1]: 'Total General'
        }, inplace=True)

    # Standardize city names
    correct_city_names = {
        'Bâle': 'Basel',
        'Genève': 'Geneva',
        'Bienne': 'Biel/Bienne',
        'Chavannes': 'Chavannes-de-Bogis',
        'Marin': 'Marin-Epagnier',
        'Vesenaz': 'Vésenaz',
        'Yverdon': 'Yverdon-les-Bains',
        'Saint-Gall Webersbleiche': 'St. Gall'
    }
    df['City'] = df['City'].apply(lambda x: correct_city_names.get(x, x))

    # Get latitudes and longitudes
    df[['Lat', 'Lon']] = df.apply(lambda row: pd.Series(get_lat_lon(row['City'])), axis=1)

    # Define color scale and feature group
    max_sales = df['Total General'].max()
    min_sales = df['Total General'].min()
    color_scale = cmp.linear.viridis.scale(min_sales, max_sales)
    fg = folium.FeatureGroup(name=category)
    
    # Add markers
    for index, row in df.iterrows():
        if pd.notnull(row['Lat']) and pd.notnull(row['Lon']):
            radius = calculate_radius(row['Total General'], max_sales, min_sales)
            folium.CircleMarker(
                location=[row['Lat'], row['Lon']],
                radius=radius,
                popup=f"{row['City']}: {row['Total General']}",
                color=color_scale(row['Total General']),
                fill=True,
                fill_color=color_scale(row['Total General'])
            ).add_to(fg)

    fg.add_to(m)
#> <folium.vector_layers.CircleMarker object at 0x000001C160C872F0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160D49850>
#> <folium.vector_layers.CircleMarker object at 0x000001C160D48A40>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C86600>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C6B530>
#> <folium.vector_layers.CircleMarker object at 0x000001C15EAA6CF0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160D13080>
#> <folium.vector_layers.CircleMarker object at 0x000001C160DE3DA0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE1AF0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE3F80>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE1BE0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE39E0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE1EE0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE2D50>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE16D0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE1370>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE1400>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE02F0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160BF0B90>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE0F20>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EDE630>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EDC6B0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EDFDA0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EDDDC0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EDDD30>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EDFF20>
#> <folium.map.FeatureGroup object at 0x000001C160CB8A10>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C86330>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C6BB90>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE2E70>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E31C40>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E31B80>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E31760>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E31E50>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E32DE0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E318B0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E32210>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E31910>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E31A60>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E32630>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E33710>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E328D0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E32660>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E32C30>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E32A50>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E338F0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E333E0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E32DB0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E33140>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E33260>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E33470>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E33500>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E33890>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E336E0>
#> <folium.map.FeatureGroup object at 0x000001C1221063C0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E6AC30>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C85460>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C85310>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE3020>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE0F50>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E32510>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E31280>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E31070>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E33CE0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E31010>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E93050>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E93200>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E92ED0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E92F90>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E92F60>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E934A0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E938C0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E93950>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E93A10>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E93B30>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E939B0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E93DA0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E92EA0>
#> <folium.vector_layers.CircleMarker object at 0x000001C122106270>
#> <folium.vector_layers.CircleMarker object at 0x000001C160DED430>
#> <folium.vector_layers.CircleMarker object at 0x000001C160E4D0D0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C759A0>
#> <folium.map.FeatureGroup object at 0x000001C160D488F0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C7ACF0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C7B6B0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C793A0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C7BA10>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C7B440>
#> <folium.vector_layers.CircleMarker object at 0x000001C160D8A120>
#> <folium.vector_layers.CircleMarker object at 0x000001C15EAA52B0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C74620>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE3C50>
#> <folium.vector_layers.CircleMarker object at 0x000001C160D13AA0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C692B0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EE1160>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EDF860>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EDCA40>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EDFA40>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EDF5F0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EDFE90>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E92D20>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E90FE0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E93F20>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EB3170>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EB2990>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EB3110>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EB24B0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EB2180>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EB31D0>
#> <folium.map.FeatureGroup object at 0x000001C160C7BF20>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C796D0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C7B8C0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F831A0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F82900>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F81F70>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F81910>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F81A30>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F81370>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F81AC0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F81D60>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F80A10>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F820C0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F81C10>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F806E0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F80980>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F817F0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F821B0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F823F0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F82720>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F822D0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F82BA0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F82480>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F824E0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F828A0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F82E40>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F82CF0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F83740>
#> <folium.map.FeatureGroup object at 0x000001C160C7A750>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E20980>
#> <folium.vector_layers.CircleMarker object at 0x000001C160EDD5B0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F83800>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F83CE0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F83950>
#> <folium.vector_layers.CircleMarker object at 0x000001C15DA8FFE0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162F83F50>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FA7170>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FA76E0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FA7C50>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FA71A0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FA72C0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FA7350>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FA79B0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FA7E00>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FA7C80>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FA6FC0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162E682F0>
#> <folium.vector_layers.CircleMarker object at 0x000001C160C849E0>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FA7290>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FF5820>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FF5670>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FF4C50>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FF4080>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FF4530>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FF5490>
#> <folium.vector_layers.CircleMarker object at 0x000001C162FF5340>
#> <folium.map.FeatureGroup object at 0x000001C160C7AD80>

heat_data = [[float(lat), float(lon)] for loc in locations for lat, lon in [loc.split(',')]]

# Add HeatMap layer
HeatMap(heat_data).add_to(m)
#> <folium.plugins.heat_map.HeatMap object at 0x000001C160C3E630>

# Add layer control and save the map
folium.LayerControl().add_to(m)
#> <folium.map.LayerControl object at 0x000001C15EAA75C0>

# Add layer control and save the map
folium.LayerControl().add_to(m)
#> <folium.map.LayerControl object at 0x000001C162FBC7A0>
m.save('combined_product_map.html')
m
Make this Notebook Trusted to load map: File -> Trust Notebook

3.2 Price To Producers Lait Cru

3.2.1 Organic Milk vs Non Organic (bio) Milk

Click to show code
# Create xts object
prices_xts <- xts(df_producteur[, c("prix_bio", "prix_non_bio")], order.by = df_producteur$date)

# Plot using dygraphs
dygraph(prices_xts, main = "Trends in Milk Prices (Organic vs. Non-Organic)", width = "600px", height = "400px") %>%
  dySeries("prix_bio", label = "Organic Price", color = "#24918d") %>%
  dySeries("prix_non_bio", label = "Non-Organic Price", color = "#7e57c2") %>%
  dyOptions(stackedGraph = FALSE) %>%
  dyRangeSelector(height = 20)
Click to show code


# Create an xts object for the delta series, ensuring the series name is retained
delta_xts <- xts(x = df_producteur[,"delta", drop = FALSE], order.by = df_producteur$date)

# Plot using dygraphsdf_
p_delta <- dygraph(delta_xts, main = "Difference in Prices Between Organic and Non-Organic Milk Over Time", width = "600px", height = "400px") %>%
  dySeries("delta", label = "Delta in Price", color = "#24918d") %>%
  dyOptions(stackedGraph = FALSE) %>%
  dyRangeSelector(height = 20)

# Print the dygraph to display it
p_delta

3.2.2 Seasonality

Click to show code
# Process the data to extract month and year
df_producteur <- df_producteur %>%
  mutate(Month = format(date, "%m"),
         Year = format(date, "%Y")) %>%
  arrange(date) # Ensure data is in chronological order

# Plotting the data with ggplot2, showing the trend within each year
p_seaso_2 <- ggplot(df_producteur, aes(x = Month, y = prix_bio, group = Year, color = as.factor(Year))) +
  geom_smooth(se = FALSE, method = "loess", span = 0.3, size = 0.7) +
  labs(title = "Monthly Milk Prices by Year",
       x = "Month",
       y = "Price of Organic Milk",
       color = "Year") +
  theme_minimal() +
  scale_color_viridis_d() +
  theme(legend.position = "bottom", axis.text.x = element_text(angle = 45, hjust = 1))
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> i Please use `linewidth` instead.

# Convert to an interactive plotly object
interactive_plot_seaso_2 <- ggplotly(p_seaso_2, width = 600, height = 400)
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : span too small.  fewer data values than degrees of
#> freedom.
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : at 6.975
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : radius 0.000625
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : all data on boundary of neighborhood. make span bigger
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : pseudoinverse used at 6.975
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : neighborhood radius 0.025
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : reciprocal condition number 1
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : at 12.025
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : radius 0.000625
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : all data on boundary of neighborhood. make span bigger
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : There are other near singularities as well. 0.000625
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : zero-width neighborhood. make span bigger

#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : zero-width neighborhood. make span bigger

#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : zero-width neighborhood. make span bigger

#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : zero-width neighborhood. make span bigger

#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : zero-width neighborhood. make span bigger

#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : zero-width neighborhood. make span bigger
#> Warning: Failed to fit group 1.
#> Caused by error in `predLoess()`:
#> ! NA/NaN/Inf in foreign function call (arg 5)
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : span too small.  fewer data values than degrees of
#> freedom.
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : pseudoinverse used at 0.945
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : neighborhood radius 2.055
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : reciprocal condition number 0
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : There are other near singularities as well. 4.223
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : span too small.  fewer data values than degrees of
#> freedom.
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : pseudoinverse used at 0.945
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : neighborhood radius 2.055
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : reciprocal condition number 0
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : There are other near singularities as well. 4.223
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : span too small.  fewer data values than degrees of
#> freedom.
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : pseudoinverse used at 0.945
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : neighborhood radius 2.055
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : reciprocal condition number 0
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : There are other near singularities as well. 4.223
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : span too small.  fewer data values than degrees of
#> freedom.
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : pseudoinverse used at 0.945
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : neighborhood radius 2.055
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : reciprocal condition number 0
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : There are other near singularities as well. 4.223
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : span too small.  fewer data values than degrees of
#> freedom.
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : pseudoinverse used at 0.945
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : neighborhood radius 2.055
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : reciprocal condition number 0
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : There are other near singularities as well. 4.223
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : span too small.  fewer data values than degrees of
#> freedom.
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : pseudoinverse used at 0.95
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : neighborhood radius 2.05
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : reciprocal condition number 0
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : There are other near singularities as well. 4.2025

# Adjust plotly settings 
interactive_plot_seaso_2 <- interactive_plot_seaso_2 %>%
  layout(margin = list(l = 40, r = 10, b = 40, t = 40), # Adjust margins
         legend = list(orientation = "h", x = 0, xanchor = "left", y = -0.2)) # Adjust legend position

# Display the interactive plot
interactive_plot_seaso_2

4 Analysis

  • Answers to the research questions
  • Different methods considered
  • Competing approaches
  • Justifications

4.1 Forecasting Next Year Milk Prices

Click to show code
# re-arragen the df_producteur data in ascending order
df_producteur <- df_producteur[order(df_producteur$date),]

#creating tsibble for organic and non-organic milk prices
df_producteur_ts_non_bio <- ts(df_producteur$prix_non_bio, start=c(2017, 12), frequency=12)
df_producteur_ts_bio <- ts(df_producteur$prix_bio, start=c(2017, 12), frequency=12)

#convert the ts object to a tsiible object
df_producteur_ts_non_bio <- as_tsibble(df_producteur_ts_non_bio)
df_producteur_ts_bio <- as_tsibble(df_producteur_ts_bio)

4.1.1 Naive Forecast

Click to show code
# Fit a naive model
fit_non_bio <- df_producteur_ts_non_bio %>% model(naive = NAIVE(value))
fit_bio <- df_producteur_ts_bio %>% model(naive = NAIVE(value))

# Forecast the next 12 months
naive_forecast_non_bio <- fit_non_bio %>% forecast(h = 12)
naive_forecast_bio <- fit_bio %>% forecast(h = 12)

plot <- naive_forecast_non_bio %>%
  autoplot(df_producteur_ts_non_bio, alpha = 0.5) +
  labs(title = "Naive Forecast of Non-Organic Milk Prices",
       x = "Date",
       y = "Price") + guides(colour = guide_legend(title = "Forecast"))
plot
plot <- naive_forecast_bio %>%
  autoplot(df_producteur_ts_bio, alpha = 0.5) +
  labs(title = "Naive Forecast of Non-Organic Milk Prices",
       x = "Date",
       y = "Price") + guides(colour = guide_legend(title = "Forecast"))
plot

We observe that this model is very vague because is just a naive model that assumes that the next value will be the same as the last value. But it gives us a starting point to compare with other models.

4.1.2 ARIMA Model

4.1.2.1 Stationarity

Click to show code
# re-arragen the df_producteur data in ascending order
df_producteur <- df_producteur[order(df_producteur$date),]

#creating tsibble for organic and non-organic milk prices
df_producteur_ts_non_bio <- ts(df_producteur$prix_non_bio, start=c(2017, 12), frequency=12)
df_producteur_ts_bio <- ts(df_producteur$prix_bio, start=c(2017, 12), frequency=12)
#check for stationarity
adf.test(df_producteur_ts_non_bio)
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  df_producteur_ts_non_bio
#> Dickey-Fuller = -3, Lag order = 4, p-value = 0.08
#> alternative hypothesis: stationary
adf.test(df_producteur_ts_bio)
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  df_producteur_ts_bio
#> Dickey-Fuller = -3, Lag order = 4, p-value = 0.08
#> alternative hypothesis: stationary

We analyse the stationarity of the time series data for both organic and non-organic milk prices using the Augmented Dickey-Fuller (ADF) test. The Augmented Dickey-Fuller (ADF) test is commonly used to determine whether a unit root is present in a time series dataset. A unit root suggests that a time series is non-stationary, meaning its statistical properties such as mean and variance change over time. On the other hand, if the null hypothesis of the ADF test is rejected, it indicates that the time series is stationary.

We do that because ARIMA models require the time series data to be stationary.

The results of the ADF test for both time series df_producteur_ts_non_bio and df_producteur_ts_bio indicate:

  • Dickey-Fuller statistic value of -3
  • Lag order of 4
  • p-value of 0.08

Solely based on the p-values provided (0.08), we cannot conclusively determine whether the time series data df_producteur_ts_non_bio and df_producteur_ts_bio are stationary or not. They might be stationary, but further analysis or additional tests might be needed for a more definitive conclusion.

We can thus differenciate the data to make it stationary.

Click to show code
#difference the time series
df_producteur_ts_non_bio_diff <- diff(df_producteur_ts_non_bio)
df_producteur_ts_bio_diff <- diff(df_producteur_ts_bio)

#plot them to see the differentiation
autoplot(df_producteur_ts_non_bio_diff)+ labs(title = "Differenced Time Series of Organic Milk Prices")
autoplot(df_producteur_ts_bio_diff) + labs(title = "Differenced Time Series of Bio Milk Prices")

#check for stationarity
adf.test(df_producteur_ts_non_bio_diff)
#> Warning in adf.test(df_producteur_ts_non_bio_diff): p-value smaller
#> than printed p-value
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  df_producteur_ts_non_bio_diff
#> Dickey-Fuller = -6, Lag order = 4, p-value = 0.01
#> alternative hypothesis: stationary
adf.test(df_producteur_ts_bio_diff)
#> Warning in adf.test(df_producteur_ts_bio_diff): p-value smaller than
#> printed p-value
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  df_producteur_ts_bio_diff
#> Dickey-Fuller = -6, Lag order = 4, p-value = 0.01
#> alternative hypothesis: stationary

The results of the ADF test for both differenced time series indicate:

  • Dickey-Fuller statistic value of -6
  • Lag order of 4
  • p-value of 0.01

In this case, the p-value is smaller than the typical significance level of 0.05, indicating strong evidence against the null hypothesis. Therefore, based on the p-values provided (0.01), we can conclude that the differenced time series data are likely stationary.

This suggests that after differencing, the time series data df_producteur_ts_non_bio and df_producteur_ts_bio have become stationary, which is often desirable for various time series analysis techniques and forecasting models.

4.1.2.2 Fitting the ARIMA Model and Forecasting

Click to show code
# Fit the ARIMA model
fit_non_bio <- auto.arima(df_producteur_ts_non_bio, seasonal = FALSE)
fit_bio <- auto.arima(df_producteur_ts_bio, seasonal = FALSE)

# Forecast the next 12 months
forecast_non_bio <- forecast(fit_non_bio, h = 12)
forecast_bio <- forecast(fit_bio, h = 12)

#show the components used for the ARIMA model
fit_non_bio %>% summary()
#> Series: df_producteur_ts_non_bio 
#> ARIMA(2,1,1) with drift 
#> 
#> Coefficients:
#>         ar1     ar2     ma1  drift
#>       1.413  -0.636  -0.908  0.193
#> s.e.  0.095   0.085   0.083  0.059
#> 
#> sigma^2 = 1.11:  log likelihood = -110
#> AIC=231   AICc=232   BIC=242
#> 
#> Training set error measures:
#>                   ME RMSE   MAE    MPE MAPE  MASE   ACF1
#> Training set -0.0319 1.02 0.789 -0.078 1.16 0.283 -0.118
fit_bio %>% summary()
#> Series: df_producteur_ts_bio 
#> ARIMA(0,1,1) 
#> 
#> Coefficients:
#>         ma1
#>       0.616
#> s.e.  0.088
#> 
#> sigma^2 = 6.39:  log likelihood = -178
#> AIC=360   AICc=360   BIC=365
#> 
#> Training set error measures:
#>                  ME RMSE  MAE    MPE MAPE  MASE     ACF1
#> Training set 0.0694  2.5 1.86 0.0626 2.18 0.677 -0.00305

#plot the forecasted values
autoplot(forecast_non_bio) + labs(title = "Forecasted Prices of Non-Organic Milk")

Click to show code
autoplot(forecast_bio) + labs(title = "Forecasted Prices of Organic Milk")

4.1.2.3 Fit a SARIMA Model

Click to show code
# Fit the SARIMA model
fit_non_bio_sarima <- auto.arima(df_producteur_ts_non_bio, seasonal = TRUE, stepwise = FALSE, approximation = FALSE)
fit_bio_sarima <- auto.arima(df_producteur_ts_bio, seasonal = TRUE, stepwise = FALSE, approximation = FALSE)

# Forecast the next 12 months
forecast_non_bio_sarima <- forecast(fit_non_bio_sarima, h = 12)
forecast_bio_sarima <- forecast(fit_bio_sarima, h = 12)

#plot the forecasted values
autoplot(forecast_non_bio_sarima) + labs(title = "Forecasted Prices of Non-Organic Milk (SARIMA)")
autoplot(forecast_bio_sarima) + labs(title = "Forecasted Prices of Organic Milk (SARIMA)")

4.1.2.4 Compare ARIMA and SARIMA forecast

4.1.2.4.1 Organic Milk SARIMA vs ARIMA
Click to show code
# compare forecast_bio vs forecast_bio_sarima Model using AIC
4.1.2.4.2 Non-Organic Milk SARIMA vs ARIMA
Click to show code
# compare forecast_non_bio vs forecast_non_bio_sarima using AIC

4.1.2.5 Forecasted Prices ARIMA

Click to show code
# Create a table of the forecasted values 
forecast_table_arima <- tibble(
  Month = seq(as.Date("2023-01-01"), by = "month", length.out = 12),
  Non_Organic_Forecast = forecast_non_bio$mean,
  Bio_Forecast = forecast_bio$mean
)
#round the forecasted values
forecast_table_arima <- forecast_table_arima %>%
  mutate(across(c(Non_Organic_Forecast, Bio_Forecast), ~round(., 2)))
#show the forecasted values using reactable
reactable(
  forecast_table_arima,  
  highlight = TRUE,  # Highlight rows on hover
  defaultPageSize = 10,  # Display 10 rows per page
  paginationType = "numbers",  # Use numbers for page navigation
  searchable = TRUE,  # Make the table searchable
  sortable = TRUE,  # Allow sorting
  resizable = TRUE  # Allow column resizing
)
Click to show code
#plot the forecasted values
forecast_table_arima %>%
  pivot_longer(cols = c(Non_Organic_Forecast, Bio_Forecast), names_to = "Type", values_to = "Forecasted_Price") %>%
  ggplot(aes(x = Month, y = Forecasted_Price, color = Type)) +
  geom_line() +
  labs(title = "Forecasted Prices of Organic and Non-Organic Milk",
       x = "Month",
       y = "Price",
       color = "Type") +
  theme_minimal()

We used the mean values of the forecasted prices for both organic and non-organic milk to create a table and plot the forecasted prices for the next 12 months. The table provides a detailed view of the forecasted prices, while the plot visualizes the trend of the forecasted prices over time.

4.1.2.6 Forecasted Prices SARIMA

Click to show code
# Create a table of the forecasted values 
forecast_table_sarima <- tibble(
  Month = seq(as.Date("2023-01-01"), by = "month", length.out = 12),
  Non_Organic_Forecast = forecast_non_bio_sarima$mean,
  Bio_Forecast = forecast_bio_sarima$mean
)
#round the forecasted values
forecast_table_sarima <- forecast_table_sarima %>%
  mutate(across(c(Non_Organic_Forecast, Bio_Forecast), ~round(., 2)))
#show the forecasted values using reactable
reactable(
  forecast_table_sarima,  
  highlight = TRUE,  # Highlight rows on hover
  defaultPageSize = 10,  # Display 10 rows per page
  paginationType = "numbers",  # Use numbers for page navigation
  searchable = TRUE,  # Make the table searchable
  sortable = TRUE,  # Allow sorting
  resizable = TRUE  # Allow column resizing
)
Click to show code
#plot the forecasted values
forecast_table_sarima %>%
  pivot_longer(cols = c(Non_Organic_Forecast, Bio_Forecast), names_to = "Type", values_to = "Forecasted_Price") %>%
  ggplot(aes(x = Month, y = Forecasted_Price, color = Type)) +
  geom_line() +
  labs(title = "Forecasted Prices of Organic and Non-Organic Milk",
       x = "Month",
       y = "Price",
       color = "Type") +
  theme_minimal()

4.1.3 Exponential Smoothing

Click to show code
# Fit the ETS model
fit_non_bio_ets <- ets(df_producteur_ts_non_bio)
fit_bio_ets <- ets(df_producteur_ts_bio)

# Forecast the next 12 months
forecast_non_bio_ets <- forecast(fit_non_bio_ets, h = 12)
forecast_bio_ets <- forecast(fit_bio_ets, h = 12)

#plot the forecasted values
autoplot(forecast_non_bio_ets) + labs(title = "Forecasted Prices of Non-Organic Milk (ETS)")
autoplot(forecast_bio_ets) + labs(title = "Forecasted Prices of Organic Milk (ETS)")

Click to show code
# Create a table of the forecasted values
forecast_table_ets <- tibble(
  Month = seq(as.Date("2023-01-01"), by = "month", length.out = 12),
  Non_Organic_Forecast_ETS = forecast_non_bio_ets$mean,
  Bio_Forecast_ETS = forecast_bio_ets$mean
)
forecast_table_ets
#> # A tibble: 12 x 3
#>    Month      Non_Organic_Forecast_ETS Bio_Forecast_ETS
#>    <date>                        <dbl>            <dbl>
#>  1 2023-01-01                     75.1             92.6
#>  2 2023-02-01                     74.7             91.9
#>  3 2023-03-01                     73.1             88.1
#>  4 2023-04-01                     71.9             86.2
#>  5 2023-05-01                     71.6             85.6
#>  6 2023-06-01                     71.8             85.6
#>  7 2023-07-01                     74.2             90.5
#>  8 2023-08-01                     76.3             95.6
#>  9 2023-09-01                     77.2             97.3
#> 10 2023-10-01                     78.1             97.6
#> 11 2023-11-01                     78.5             96.9
#> 12 2023-12-01                     77.6             92.9

#plot the forecasted values
forecast_table_ets %>%
  pivot_longer(cols = c(Non_Organic_Forecast_ETS, Bio_Forecast_ETS), names_to = "Type", values_to = "Forecasted_Price") %>%
  ggplot(aes(x = Month, y = Forecasted_Price, color = Type)) +
  geom_line() +
  labs(title = "Forecasted Prices of Organic and Non-Organic Milk (ETS)",
       x = "Month",
       y = "Price",
       color = "Type") +
  theme_minimal()

Click to show code
# compare ARIMA and ETS forecast

4.2 Lait Equitable Analysis

4.2.1 Pareto Principle

The Pareto Principle, often known as the 80/20 rule, asserts that a small proportion of causes, inputs, or efforts usually lead to a majority of the results, outputs, or rewards. Applied to a business context where approximately 20% of the sales account for 80% of the revenues, this principle can help in identifying and focusing on the most profitable aspects of a business.

Evidence from Research:

Sales and Customer Concentration: Research has consistently shown that a significant portion of sales often comes from a minority of customers or products. For instance, an analysis across 22 different consumer packaged goods categories found an average Pareto ratio (PR) of .73, indicating that the top proportion of products/customers often account for a disproportionately high share of sales or profits Source - Kim, Singh, & Winer, 2017

Decision Making and Resource Allocation: The Pareto Principle helps in decision-making by highlighting areas where the greatest impact can be achieved. For example, focusing on the top-performing products or customers can optimize resource allocation and maximize profits Source - Ivančić, 2014

Market and Profit Concentration: Another study noted that a small number of customers are often responsible for a large portion of sales, which supports the strategic focus on these customers to boost profitability and efficiency Source- McCarthy & Winer, 2018

Conclusion: Applying the Pareto Principle in a business context where a minority of sales drives the majority of revenue can lead to more focused and effective business strategies, optimizing efforts towards the most profitable segments. This approach not only simplifies decision-making but also enhances resource allocation, ultimately leading to increased profitability.

4.2.1.1 Steps

  1. Calculating the total sales across all locations for both 2022 and 2023.
  2. Ranking locations by sales to see the cumulative contribution of each location towards the total.
  3. Identifying the point where approximately 20% of the locations contribute to around 80% of the sales.
Click to show code
# Calculate the total sales for each year and the combined total to apply Pareto Principle
merged_sales_data['Combined Sales'] = merged_sales_data['Total Sales 2022'] + merged_sales_data['Total Sales 2023']

# Sort locations by combined sales
pareto_data = merged_sales_data.sort_values(by='Combined Sales', ascending=False)

# Calculate cumulative sales
pareto_data['Cumulative Sales'] = pareto_data['Combined Sales'].cumsum()

# Calculate the total of combined sales
total_combined_sales = pareto_data['Combined Sales'].sum()

# Calculate the percentage of cumulative sales
pareto_data['Cumulative Percentage'] = 100 * pareto_data['Cumulative Sales'] / total_combined_sales

# Find the point where about 20% of the locations contribute to approximately 80% of the sales
pareto_data['Location Count'] = range(1, len(pareto_data) + 1)
pareto_data['Location Percentage'] = 100 * pareto_data['Location Count'] / len(pareto_data)

# Plotting the Pareto curve
plt.figure(figsize=(12, 8))
cumulative_line = plt.plot(pareto_data['Location Percentage'], pareto_data['Cumulative Percentage'], label='Cumulative Percentage of Sales', color='b', marker='o')
plt.axhline(80.2, color='r', linestyle='dashed', linewidth=1)
plt.axvline(33.3, color='green', linestyle='dashed', linewidth=1)
plt.title('Pareto Analysis of Sales Across Locations')
plt.xlabel('Cumulative Percentage of Locations')
plt.ylabel('Cumulative Percentage of Sales')
plt.legend()
plt.grid(True)
plt.show()

Given this graph 33.2% of Manor locations are contributing to 80% of sales. This deviates from the typical Pareto 80/20 distribution, but it still shows a concentration of sales among a subset of stores.

4.2.1.2 Observations

We will identify the top 33.3% of locations based on their cumulative sales contribution. This means selecting the smallest number of locations that together account for at least 80% of the total sales.

The top-performing 33.3% of Manor locations that contribute to the majority of sales are:

Click to show code
# Calculate the threshold for the top 33.3% of locations
top_third_index = int(len(pareto_data) * 0.34)

# Identifying the top 33.3% of stores contributing to at least 80% of sales
top_performing_stores = pareto_data.head(top_third_index)
top_performing_stores
#>               Location  Total Sales 2022  ...  Location Count  Location Percentage
#> 14             Monthey             49965  ...               1             3.703704
#> 20              Sierre             32212  ...               2             7.407407
#> 13      Marin-Epagnier             30228  ...               3            11.111111
#> 23               Vevey             25720  ...               4            14.814815
#> 10              Geneva             16068  ...               5            18.518519
#> 21                Sion             15208  ...               6            22.222222
#> 9             Fribourg             14792  ...               7            25.925926
#> 11            Lausanne             13186  ...               8            29.629630
#> 5   Chavannes-de-Bogis             14359  ...               9            33.333333
#> 
#> [9 rows x 8 columns]

4.2.2 Understanding Success Factors of Top-Performing Stores

4.2.2.1 Correlating Political Parties with Milk Sales

Click to show code
# Calculate correlation coefficients for each party
correlation_df <- data.frame(Party = c("PLR", "PS", "UDC", "Centre", "Verts"),
                             Correlation = sapply(merged_data[, 4:8], function(x) cor(x, merged_data$`2023`)))

# Print the correlation coefficients
print(correlation_df)
#>         Party Correlation
#> PLR       PLR      0.2796
#> PS         PS      0.1853
#> UDC       UDC     -0.1933
#> Centre Centre      0.0375
#> Verts   Verts      0.1708

# Create a matrix of plots for each party
party_plots <- lapply(names(merged_data)[4:8], function(party) {
  ggplot(merged_data, aes_string(x = "`2023`", y = party)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, color = "blue") +
    labs(x = "Annual Sales", y = paste(party, "Party Presence (%)"), title = paste("Correlation:", party, "Party vs. Sales")) +
    theme_minimal()
})
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> i Please use tidy evaluation idioms with `aes()`.
#> i See also `vignette("ggplot2-in-packages")` for more information.

# Arrange the plots in a matrix layout
matrix_plot <- gridExtra::grid.arrange(grobs = party_plots, ncol = 2)
matrix_plot
#> TableGrob (3 x 2) "arrange": 5 grobs
#>   z     cells    name           grob
#> 1 1 (1-1,1-1) arrange gtable[layout]
#> 2 2 (1-1,2-2) arrange gtable[layout]
#> 3 3 (2-2,1-1) arrange gtable[layout]
#> 4 4 (2-2,2-2) arrange gtable[layout]
#> 5 5 (3-3,1-1) arrange gtable[layout]

4.2.2.2 Correlating average revenue with Milk Sales

Click to show code
# Create a scatterplot
ggplot(merged_df, aes(x = `Revenu/contribuable`, y = `2022`)) +
  geom_point(aes(color = Location)) +
  labs(x = "Revenu/contribuable", y = "Sales 2022", title = "Relationship between Revenu/contribuable and Sales in 2022") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

# Create another scatterplot for 2023
ggplot(merged_df, aes(x = `Revenu/contribuable`, y = `2023`)) +
  geom_point(aes(color = Location)) +
  labs(x = "Revenu/contribuable", y = "Sales 2023", title = "Relationship between Revenu/contribuable and Sales in 2023") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

4.2.2.3 Correlating average revenue with Milk Sales

5 Conclusion

  • Take home message
  • Limitations
  • Future work?